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Abstract 

Sparse representation based classification (SRC) is popularly used in many ap¬ 
plications such as face recognition, and implemented in two steps: representation 
coding and classification. For a given set of testing images, SRC codes every image 
over the base images as a sparse representation then classifies it to the class with 
the least representation error. This scheme utilizes an individual representation 
rather than the collective one to classify such a set of images, doing so obviously 
ignores the correlation among the given images. In this paper, a joint representa¬ 
tion classification (JRC) for collective face recognition is proposed. JRC takes the 
correlation of multiple images as well as a single representation into account. Un¬ 
der the assumption that the given face images are generally related to each other, 
JRC codes all the testing images over the base images simultaneously to facilitate 
recognition. To this end, the testing inputs are aligned into a matrix and the joint 
representation coding is formulated to a generalized I 2 , q — ( 2 ,p-minimization prob¬ 
lem. To uniformly solve the induced optimization problems for any q £ [1,2] and 
p 6 (0, 2], an iterative quadratic method (IQM) is developed. IQM is proved to be 
a strict descent algorithm with convergence to the optimal solution. Moreover, a 
more practical IQM is proposed for large-scale case. Experimental results on three 
public databases show that the JRC with practical IQM no only saves much com¬ 
putational cost but also achieves better performance in collective face recognition 
than the state-of-the-arts. 

Keywords: SRC; JRC; IQM; practical IQM. 


1 Introduction 

Recently, representation coding based classification and its variants have been devel¬ 
oped for face image recognition (FR) [1-5]. This schemes achieve a great success in FR 
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and boost the applications of image classification [6, 7], Sparse representation based 
classification (SRC) [1] is the most known one which directly uses the sparse code for 
classification and efficiently recognizes the class giving the most compact representa¬ 
tion. The main idea can be summarized to two steps: 1) coding a testing sample as a 
linear combination of all the training samples, then 2) classifying the testing sample to 
the most compact one by evaluating coding errors. Typical SRC employs the following 
1 1 -minimization as the sparse representation model, 

min||x||i s.t. \\y — Ar|| 2 < e, (1) 

X 

where A £ f? mxd i s the dictionary of coding atoms and y £ R m is a given observation. 
x £ R d is the coding vector and £ > 0 denotes a noisy level. SRC outputs the identity 
of y as 

identity(y) = arg min {||y - Ax* || 2 }, (2) 

1 <1<I 

where / denotes the number of classes and x* is the coding coefficient vector associated 
with class i. The experimental results reported in [1] exhibit that SRC scheme achieves 
amazing performance. But the authors of [2] argued that SRC over emphasized the 
importance of l\ -norm sparsity but ignored the effect of collaborative representation. 
Consequently, a collaborative representation based classification with regularized least 
square (CRC-RLS) was presented in [2] for face recognition 

min||x ||2 s.t. \\y — Ax \\2 < £. (3) 

X 

Anyway, problem (3) is easier to solve than (2) for its smoothness. Models (2) and (3) 
can be considered as the least square problems with different regularizers, 

min \\y — Ax\\% + A||at||i and min ||t/ — j4at||| + A||o;|||. (4) 

X X 

Moreover, Wright et al. [3] ever used variant l\— norm to improve the coding fidelity 
of y over A, 

min \\y — Ax\\\ + A||at|| i. (5) 

X 

Actually, the models (2)-(5) can be uniformly included in the framework 

xmn\\y-Ax\\ q q + \\\x\\ p p , 1 < q < 2, 0 < p < 2. (6) 

In (6), the representation and regularization measurements are extended to be || • || g ( 1 < 
q < 2) and || • || p ( 0 < p < 1) respectively. This modification provides possibility to 
adaptively choose the most suitable model for different applications. Moreover, the 
computational experiences [13-15] have showed that fractional norm l. p (0 < p < 1) 
exhibits sparser pattern than Zi-norm. The unified generalization formula (6) is ex¬ 
pected to achieve better performance. On the other hand, model (6) is a vector repre¬ 
sentation based framework which implies the following weaknesses. 

• Model (6) uses coding vector to represent testing samples one by one. In many 
face recognition, a great of number of images for each known subject have been col¬ 
lected from video sequence or photo album. The face recognition has to be conducted 
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with a set of probe images rather than a single one [8]. In this case, representation 
coding based classification like model (6) can not efficiently work. 

• Any testing sample is coded independently from each other in (6). This approach 
takes no account of the correlation hidden in the image set. The difference and simi¬ 
larity between multiple pictures are totally ignored. It is well known that the collective 
faces share some similar feature patterns, such as eye or month pixels is more powerful 
in discrimination than those of forehead or cheek. 

• When q,p in (6) take different values, the involved optimization problems have 
to be solved by different algorithms. For example, (1) is solved by 1 1 — l s solver [9] or 
alternative direction of multiplier method while (3) chooses the algorithm presented in 
[ 2 ]. 

To overcome the weaknesses in (6) and make sufficient use of collective relation¬ 
ship among the given set of images, we consider to jointly represent all the test sam¬ 
ples simultaneously over the training sample base. Here we employ matrix instead of 
vector as the coding variable to evaluate the distribution of feature space. This idea 
induces a joint representation based classification (JRC) for collective face recognition 
and reduces it to a 1-2.,, — A, 2 ,-minimization. To solve the derived optimization prob¬ 
lem, a unified algorithm is designed and its convergence behavior is also analyzed. 
Experiments on three public face datasets validate the improvement of JRC over the 
state-of-the-arts. 

This paper is organized as follows. In the second section, a joint representation 
based classification (JRC) will be established. The third section is dedicated to a uni¬ 
fied algorithm for solving the special optimization problem induced by JRC. Some 
computational details are considered in the fourth section and an improved practical 
algorithm is proposed. The experimental results are reported in the last section. 

2 Joint Representation Classification for Collective Face 
Recognition 

2.1 Joint Representation Model 

Suppose that we have I classes of subjects in the dataset. £ R mxdi [\ < i < I) 

denotes the ?'-th class, and each column of A, is a sample of class i. Hence all the 

/ 

training samples are aligned by A = [Ai. A 2 , • • • , Ai] £ R mxd , where d = Y <k- 

i =1 

Given a collection of query images 2 / 1 , ?/2 ,•• • , y n £ R m , model (6) codes each yj (1 < 
j < n) by the training samples A as 


Vj ~ Axj , (7) 

where Xj £ R d is the coding vector associated with yj. If yj is from the *^th class, 
then Ai is the most compact representation dictionary and the optimal solution x* to 
(6) can be used for classification. Obviously, coding pattern (7) depends on the single 
test sample yj individually for classification but takes no account of the correlation 
with other samples (yi,l A ./)• Even though different frontal faces take on different 
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appearances, they share similar features such as two eyes and brows at the upper face 
while nose and mouth at the lower. Difference and similarity of multiple face pictures 
form a unitary feature of the given set of images which play an important role for 
collective face recognition. 

Denote Y = [t/i, 2 / 2 , • • • , y n \ £ R rnxn all the query images, we propose to jointly 
represent the image set simultaneously by 

Y w AX, (8) 

where X = [x\,X 2 , • • • , x n \ £ R dxn stands for the collective coding matrix. As far 
as the columns are concerned, system (8) is an easy consequence of (7). To measure 
the fidelity of the joint coding system (8), we consider X in another sense. Let A 1 £ 
R d and Y 1 £ R n be the i— th (i = 1,2, • • • , m) row vectors of matrix A and Y 
respectively, formula (8) is equivalent to 

X T (A i ) T m (Y*) T for » = 1,2,--- , m. (9) 

It is noticed that A and Y array the sampled images column by column, hence their 
rows span the feature space. In feature extraction view, the collective coding matrix 
X also projects the training feature space to approximate the testing feature space. 
Traditional least square regression aims to minimize the error 

m m 

mm^lix^f-crfll! or min£ \\A*X - n|| . (10) 

2=1 2—1 

Actually (10) can be easily reformulated as 

m 

mm^\\(AX-Yy\\l, (11) 

2 = 1 

where (AX — Y)' is the i—th row vector of AX — Y. Especially when the number 
of column in AX — Y is 1, the formula (11) is reduced to the fidelity function of (4). 
Then we prefer a uniform generalization of (4) and (5) in the sense 

m 

J2\\(AX~Yy\\l, (1< 9 <2). (12) 

Under the assumption that joint representation and feature distribution share the similar 
pattern for all testing face images, we use the following regularization 

d 

Y (0 < p < 2), (13) 

2=1 

where X 1 is the i- th row vector of X for i = 1, 2, • • • , d. Combining (12) and (13), 
we present the joint representation model for classification as follows 

m d 

mmYUAX-YYWl + XYm^ (1 < 7 < 2,0 < p < 2). (14) 

2=1 2=1 
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When the number of column in Y is 1, model (14) is reduced to coding vector version 
(6). Compared with coding vector x, joint coding matrix X unites sample representa¬ 
tion with feature projection which somewhat reflects the integral structure of dataset. 
Hence (14) is a general extension of (3)-(6). To simplify the formulation, we introduce 
the mixed matrix norm Z 2 ,p (p > 0) (taking ||X|| 2 , P for example) 

d 

imi 2 , P = oriixii^, x&R d ><", (is) 

i= 1 

where X 1 denotes the i —th row of X. Then (14) is rewritten as 

mm\\AX-Y\\l q + X\\Xf 2tP , (1 < q < 2,0 < p < 2). (16) 

Especially when p £ (0,1), Z 2 , P is not a valid matrix norm because it does not satisfy 
the triangular inequality of matrix norm axioms. Meanwhile the involved fractional 
matrix norm based minimization (16) is neither convex nor Lipschitz continuous which 
brings computational challenge. Designing an efficient algorithm for such Z 2 ,g — h,p- 
minimizations is very important. It is also the most challenging task in this paper. 


2.2 Joint Representation Based Classification 

For fixed parameter q and p, suppose that X* is a minimizer of optimization problem 
(16), that is 

X* = argmin \\AX - Y\\\ q + A||X||^ . (17) 

A 

If X* is partitioned to I blocks as follows 


X* = 


X{ 


X* 


X ? J 


(18) 


where X* G (1 < % < J). Let X* denote the coding matrix associated with 

class i, that is 

r o i 


X* 



(19) 


0 


then AX* = AjX* (1 < i < I). For each testing image y 3 (j = 1, 2, • ■ • , n), we 
classify tjj to the class with the most compact representation. By evaluating the error 
corresponding to each class 


\\(Y - AX*)j\\ 2 , i = 1,2, ■ ■ ■ ,I 


( 20 ) 


5 



we pick out the index outputting the least error. The joint representation based classifi¬ 
cation for face recognition can be concluded as follows. 

Algorithm 2.1. (JRC scheme for FR) 

1. Start: Given A £ R mxd , Y £ R mxn and select parameters A > 0, q £ [1,2] 
and p £ (0, 2], 

2. Solve l 2 , q — l' 2 ,p~minimization problem (16) for coding matrix X*. 

3. For j = 1 : n 

For i = 1 : I 

ei{y j ) = \\{Y-A i X*) j h 

end 

Identity (yj) = arg min {e,(t/j)} 

1 <i<I 

end 

When n = 1, observation Y contains only a single testing sample and JRC is 
reduced to vector representation based classification. Further on, SRC, CRC-RLS and 
(i-norm fidelity model (5) are the special cases of JRC when q = 28zp=l,q = p= 2 
and q = p = 1 respectively. In short, the main contributions of JRC lie in: 

1. JRC implements collective face representation simultaneously. This scheme is 
more economical and efficient in computational cost and CPU time. Moreover, 
JRC can handle image set based face recognition which broadens the applica¬ 
tions of vector representation based classifications. 

2. Joint coding technique fuses the difference of each testing sample representation 
and the similarity hidden in the feature space of multiple face images. For ex¬ 
ample, when 0 < p < 1 all query image are jointly represented by the training 
samples with the similarly sparse feature distribution. 

3. In the next section, a uniform algorithm will be developed to solve the optimiza¬ 
tion problem (16) for any q £ [1,2] and p £ (0,2). The algorithm is strict 
decreasing until it converges to the optimal solution to problem (16). To the 
best of our knowledge, it is an innovative approach to solve such a generalized 
h,q ~ ^ 2 ,p-minimization. 

It is worth to point out that the JRC scheme can be easily extended for the presence 
of pixel distortion, occlusion or high noise in test images. Modify (8) as 

Y = AX + E, (21) 


where E £ R mxn is an error matrix. The nonzero entries of E locate the corruption or 

X 


occlusion in Y. Substitute A = [A, I] £ R mx ( d + m ) and X = 


E 


G /jK^+ m ) Xn 


for A and X respectively, a stable joint coding model can be formulated to 


min||iX-y||! +A||X|| 




2 ,p> 


(1 < q < 2,0 <p < 2). 


( 22 ) 
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Once a solution X 


to (22) is computed, setting Y* = Y — E* recovers 

a clean image from corrupted subject. To identity the testing sample yj, we slightly 
modify the error of yj with each subject ei(yj) = ||(F — E* — AiX*)j || 2 - Thus a 
robust JRC is an easy consequence of Algorithm 2.1. The corresponding algorithm 
and theoretical analysis can be similarly demonstrated. This paper will not concentrate 
on this subject. 



3 An Iterative Quadratic Method for JRC 

Obviously, efficiently solving optimization problem (16) plays the most important role 
in scheme 2.1. The mentioned models (1), (3) and (5) are special cases of (16), the 
algorithms used in [1-3] to solve those special problems can not be directly extended. 
Such generally mixed matrix norm based minimizations as (16) have been widely used 
in machine learning. Rakotomamonjy and his co-authors [10] proposed to use the 
mixed matrix norm l q p (1 < q < 2 , 0 < p < 1) in multi-kernel and multi-task learn¬ 
ing. But the induced optimization problems in [10] have to be solved separately by 
different algorithms with respect to p 1 and 0 < p < 1. For grouped feature selec¬ 
tion, Suvrit [11] addressed a fast projection technique onto l \ 7 ,-norm balls particularly 
for p = 2, 00 . But the derived method in [11] does not match model (16). Similar 
joint sparse representation has been used for robust multimodal biometrics recognition 
in [12]. The authors of [12] employed the traditional alternating direction method of 
multipliers to solve the involved optimization problem. Nie et al. [16] applied ( 2 , 0 +- 
norm to semi-supervised robust dictionary learning, while the optimization algorithm 
has not displayed definite convergence analysis. 

In this section, a unified method will be developed to solve the h,q~ ( 2 ,p-miruimzation 
(16) for any 1 < q < 2 and 0 < p < 2. Especially when p g (0,1), (16) is neither 
convex nor non-Lipschitz continuous which results in much computational difficulties. 
Motivated by the idea of algorithm in [17] for solving ? 2 ,p (0 < p < l)-based mini¬ 
mization, we design an iteratively quadratic algorithm for such ( 2,9 — ( 2 ,p-mmHmzation. 
Moreover, the convergence analysis will be uniformly demonstrated. 


3.1 An Iteratively Quadratic Method 

After simply transformation, the definition of H-Xjlf p (15) can be rewritten as 

\\X\\l p = Tr(X T HX), (23) 

where 
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and TY(-) stands for trace operation. If denote 


1 


G = 


dlag ^ IKAA'-V) 1 !]:; 5• |!(.4A--t-)2||-^ 

I, 


||(AA:-yni^ 


}, 


Q G [1,2); 

9 = 2, 

(25) 


the objective function of (16) can be reformulated to 

J(X) := Tr{{AX - F) T G(AX - Y)) + A Tr(X T HX). (26) 

Hence the KKT point of unconstrained optimization problem (16) is also the station¬ 
ary point of J( X), 


= qA T G{AX -Y) + XpHX = 0 , (27) 

Osv 

solving (16) is reduced to find the solution to equations (27). If A T GA + Ais 
invertible, equation (27) is equivalent to 

X = ( A t GA + A ^H)~ 1 A t GY. (28) 

To find the iterative solution to system (28), let us consider a closely related opti¬ 
mization problem 

min J(X) := TrUAX - Y) T G(AX - Y)) + A -Tr(X T HX). (29) 

x q 

J(X) is almost equivalent to J(X) in spite of a scaled factor | in regularization pa¬ 
rameter. If an iterative approximate solution to (29) has been generated, Gk and 
III,, can be derived from Xk as definitions (24, 25). Then we can compute the next 
iterative matrix Xk+i by solving the following subproblem 

mi n Tr((AX - Y) T G k {AX - Y)) + A -Tr{X T H k X). (30) 

A' q 

Actually, (30) is a scaled quadratic approximation to J(X) at the iterative point 
X k . Let M k = A T G k A + \Z q H k , since G k and H k are usually symmetric and positive 
definite, problem (30) is equivalent to the following quadratic optimization problem 

min Q k {X) := \Tr{X T M k X) - Tr{Y T G k AX). (31) 

X z 

The minimizer to Q k (X) is also the solution to the linear system 


M k X = A T G k Y. (32) 

Based on the analysis and equations (23-32), the mixed ? 2 ,g — h ,p (1 < q < 2, 0 < 
p < 2) norm based optimization problem (16) can be iteratively solved by a sequence of 
quadratic approximate subproblems. Hence we name this approach iterative quadratic 
method (IQM). It is concluded as follows. 



Algorithm 3.1. (IQM for Solving Problem (29)) 

1. Start: Given A £ R mxd , Y £ R mxn and select parameters A > 0, q £ [1,2] 
and p £ (0, 2], 

2. Set k = 1 and initialize X\ £ R dxn . 

3. For k = 1, 2, • • • until convergence do : 

H k = diag{jj^r=p}f =1 (0 < p < 2) or H k = I d (p = 2); 

C k = —Y ; 

For i = 1 : / 

-Bj — Ai(\ k )i', 

C k = Bi + ( 7 ^; 
end 

G k = diagij^:-}™, (1 < q < 2) or G fe = I m (q = 2); 

A fc+1 = M^A T G k Y. 

It is noticed that each iteration has to compute the inverse of M k in Algorithm 

3.1 which is expensive and unstable. Here we suggest to employ the general Penrose 
inverse of M k to update the X k+ \. Moreover, the main computation A, X* for clas¬ 
sification is a by-product of B, in computing the approximate solution X*. Hence 
identifying test images can be achieved with minor extra calculations. 

Algorithm 3.1 is a unified method solving l 2 , q — h,p~ minimizations for q £ [1, 2] 
and p £ (0,2]. This approach provides algorithmic support to adaptively choose better 
fidelity measurement and regularization in various applications. Especially IQM pro¬ 
vides a uniform algorithm for solving the existed representation based models: sparse 
representation (q = 2, p = 1 ), collaborative representation (q = p = 2 ) and l \-norm 
face recognition (q = p = 1 ). 

3.2 Convergence Analysis of IQM 

In this part, we will demonstrate the theoretical convergence of Algorithm 3.1. The key 
point is that the objective function J(X) strictly decreases with respect to iterations 
until the matrix sequence {A k } converges to a stationary point of J(X). 

Lemma 3.1. Let ip(t) = t — af «, where a £ (0,1). Then for any t > 0, ip(t) < 1 — a, 
and t = 1 is the unique maximizer. 

Proof Taking the derivative of ip(t) and set to zero, that is 

<p'(t) = 1 — t “ -1 = 0 , 

then <p'(t) = 0 has the unique solution t = 1 for any a £ ( 0 , 1 ) which is just the 
maximizer of ip(t) in ( 0 , + 00 ). □ 
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Lemma 3.2. Given X & and Xk+i in R dxn , the following inequalities hold, 


\\AX k+1 -Y\\l q 


gf \\{AX k+1 -Yy\\l ^ 
2fr[\\{Ax k -Yy\\l-« ~ 


l)\\AX k -Y\\lq 


( 33 ) 


and 




2 ,p 


d 

ly 

2 ^ 

2=1 


\\K 


k +1 


IX 


k M2 




(34) 


for any q £ [1,2) and p £ (0,2). Moreover, the equalities in Eq. (33) and (34) 
hold if and only if \\(AXk+i — YfWz = \\(AXk — Yf W 2 for i = 1,2 ,m and 
\\Xi +1 \\ 2 = \\Xlh for i= 1,2,. ■■ ,d. 


Proof 

obtain 


Substituting t\ = ^ anc * sett ^ n 8 °i = 2 ' n L emm a 3.1, we 

JIC^fc+i - yyWl g IIQ4X fc+1 - yY\\* q 

\\(AX k -Yy\\\ 2 \\(AX k -Yy\\l - 2‘ 


Similarly taking t -2 


\\xUAl 

PHI5 


and a 2 


| in we have 


\\Xl\\ P 2 2 HAllll - 2‘ 


(36) 


Multiplying Eq. (35) and Eq. (36) by ||(AXfc — X)*||| and ||X ^||2 respectively, we 
have the following inequalities simultaneously 


||( AX k+1 -Y)% 


q \\(AX k+1 -YY\\l ^ 

2 \\(Ax k -Yy\\l~ q - 


|)||K4x fc -y)il 


(37) 


for * = 1 , 2 , • • ■ , to, and 


wn + i\\ p 2 


p ii^ + 1 m <(1 p, 

2 \\xi\\ 2 - p - 2 


l*£ll§, 


i = 1,2, 


,d. 


(38) 


Summing up * in formulas (37) and (38), we can derive (33) and (34). 

Based on Lemma 3.1, t\ = 1 and t 2 = 1 are the unique minimizers for p(t) in 
(0,+ 00 ) when at = | and a 2 = § respectively. Namely, ||(^4Xfc + i — y )®|| 2 = 
||(AXfc — y ) l || 2 and ||X ^ +1 ||2 = ||Xfc || 2 are necessary and sufficient for equalities 
hold in (37) and (38) respectively. □ 


Remark 3.1. (33) and (34) are established nothing to do with Algorithm 3.1. The 
inequalities express the innate properties of mixed matrix norms h.q — h.pfor q £ [1,2) 
and p £ ( 0 , 2 ). 

Theorem 3.1. Suppose that { X).} is the matrix sequence generated by Algorithm 3.1. 
Then J(X]f) strictly decreases with respect to k for any 1 < q < 2 and 0 < p < 2 
until {Xfc} converges to a stationary point of J(X). 
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Proof Based on the procedure of Algorithm 3.1, X k+ i is the solution to linear system 
(32), also the optimal matrix of problems (30) and (31). Thus we have 

Qk(X k+ 1 ) < Q k (X k ) . (39) 

For q G [1,2) and p G (0, 2), (39) is equivalent to 


Y' ~~ a YWi . y y- P^+illl 

jriWiAXk-Ynl-* P ^i\\xi\\l~ p 


<q\\AX k -Y\\l q + Xp\\X k \\l p , 


(40) 


It is noticed that J(X k ) = — Y \\2 p + A||J*Cfc ||2 p- Adding inequalities (33) and 

A-(34), the following formula will be derived 


J(x k+ 1) - (IE 


|| (AX k+1 -YY 
II {Ax k -vy 


d 

Af E 


ll^+lll 


111 -5 1 '" 2 £i IIAll, 

< J{X k ) - (III AX k - Y HE + AfllXfcll^) . 


. II 2 \ 

~P ) 


(41) 


Based on (40) and (41), J(Xfc+i) < J(X k ) can be easily derived for q G [1, 2) and 

V G (0,2). 

For q = 2 or p = 2, the inequalities is much easier to derive. Taking q = 2 and 
p G (0,2) for example, (39) is reduced to 


\\AX k +i 


mi2,2+A 


E 

i=1 


\x\ 


fc+i 


I xi 


< \\AX k -Y\\l 2 + y-\\X k \\ p 2p 


(42) 


Combining the formulas (42) and (34), we also obtain J(X k + 1 ) < J(X k ). In the case 
of q = 2, p G (0,2) or q = p = 2, J(X k+ \) < J(X k ) can be deduced analogously. 

Once J(X k + 1 ) = J{X k ) happens for some k, the equalities in (40) and (41) (or 
(42)) hold. Hence the equalities in (33) and (34) are active. From Lemma 3.2, we obtain 
II (AX k+1 - y )<|| 2 = II (AX k - y )<|| 2 for i = 1,2, • • • ,m and ||A | +1 || 2 = ||Xj || 2 
for i = 1, 2, • • • . d. Thus G k + 1 = G k and H k+ \ = H k which implies that X k +\ is a 
solution to (28). □ 

The objective function sequence { J(X k )} is decreasing and lower bounded. Hence 
{.7( X k )} eventually converges to some minimum of problem (16). The descending 
quantity measures the convergence precision. 

Remark 3.2. The stopping criterion of Algorithm 3.1 can be chosen as J(X k ) — 
J( X k+ i) < e or p k \= J ^ Xk ' > j(x'k) k+1 ^ — e f or some required precision e > 0. 

Theoretically, X l k = 0 or C k = 0 likely occurs in some step />:, then H k and G k can 
not be well updated for non-Frobenius norm case (0 < p < 2 and 1 < q < 2). We deal 
with it by perturbing with S > 0 such that {H k }u = 5 P 2 > 0 and {G^u = S q 2 > 
0. The descending of { J(X k )} is relaxed to 

J(X k+1 ) < J(X k ) + (1 - |)^ or J(X fc+1 ) < J(X k ) + (1 -■ (43) 

If the convergence precision e is chosen fairly larger than perturbation die 5), 

perturbed J(X k ) can be still considered approximate decreasing. As a matter of fact, 
X'‘ k = 0 and C\ - 0 never happen in practical implementation. 
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4 Practical Implementation of JRC 


In Algorithm 3.1, IQM has to update the matrix sequence by computing the inverse ma¬ 
trix of M k . It is expensive in practical implementation especially for large scale prob¬ 
lems. Reviewing the procedure of Algorithm 3.1, we notice that X k +i = M^ 1 A T G k Y 
exactly solves the fc—th subproblem (31) which is unnecessary. It is observed that (31) 
is a quadratic positive definite subproblem. There are a lot of efficient algorithms to 
solve it approximately, such as conjugate gradient method, gradient methods with dif¬ 
ferent stepsizes, etc. In this paper, we choose Barzilai and Borwein (BB) gradient 
method due to its simplicity and efficiency. BB gradient method was firstly presented 
in [18], afterwards extended and developed in many occasions and applications [18- 
23]. When applied to quadratic matrix optimization subproblem (31), the Barzilai and 
Borwein gradient method takes on 

X l t+1) = X k ] - a k ] X Qk ) - ( 44 ) 


where the superscript (t) denotes the i—th iteration solving (31). \/Q k (X^) is the 
gradient matrix of Q k (X) with respect to X X 

VQ k (X^) = MkX™ - A T G k Y . (45) 

The Barzilai and Borwein gradient method [ 18] chose the stepsize a ft such that D < l * > = 
a ^ / has a certain quasi-Newton property 

D = arg min ||S^ - DT ||^ (46) 

D=al 


= arg min \\D L S] 




D=al 


— T, 


(t-i)i 


(47) 


where II • ||f denotes Frobenius matrix norm and s[! 1 \ are determined by the 


information achieved at the points X 


(t) 


and 


i) •— _ x ^~ 1 ) 


T^- 1] := VQ fc (X^) - VQfc(X^ _1) ) = 

Solving (46) yields two BB stepsizes 


(t) 

a,. = 


Tr((S' 


(*—t )\T rpit—V) 


] ) 


Tr((T ^- 1) )T T R- 1 )) 


and 


At) 


Tr((S\ 






At- 1 ) 


) T M k st 1] ) 


(48) 


(49) 


(50) 


Tr((S^ 

Compared with the classical steepest descent method, BB gradient method often 
needs less computations but converges more rapidly [24], For optimization problems 
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higher than two dimensions, BB method has theoretical difficulties due to its heavy 
non-monotone behavior. But for strongly convex quadratic problem with any dimen¬ 
sion, BB method is convergent at R— linear rate [19, 21], BB method has also been ap¬ 
plied to matrix optimization problem [25] and exhibited desirable performance. Based 
on equations (44)-(50), the last step in Algorithm 3.1, X k+ \ = M^ 1 A T G k Y, can be 
practically substituted by the BB gradient method as the k— th inner loop. 

Algorithm 4.1. (BB Gradient Method for Solving Subproblem (31)) 

1. Start: given the inner loop stopping criterion e 2 > 0 

2. Initialize = X k and VQ^ = M k X { ^ - A T G k Y; 

3. Fort= 1,2,--- until Tr(VQ^) < £ 2 , output X k +i = Xj?\ do : 

if t = 1 

At) Tr((VQ[(>) T VQ[(>) . 
k Tr((VQ l £ ) ) T M k 'VQ l £ ) ) ’ 

else 

1) _ yM _ y(£ — 1). 

T^" 1) = vqW_vq^- 1 ). 

off is computed as (49) or (50); 

end 

x k +1) = x k ] - a( k x Q ( ki 

XQ ( k +1) = M k xi t+1) - A T G k Y- 

In the k— th inner loop. Algorithm 4.1 chooses two initial matrices. One is the ap¬ 
proximate solution X k to the last subproblem and another one is the Cauchy point from 
X k [26] . The Cauchy stepsize a k is the solution to the one-dimensional optimization 
problem 

min := Q k {X k - aVQ k (X k )) , (51) 

a>0 

then the Cauchy point is X k + a^VQ k (X k )). If M k in Algorithm 3.1 is guaranteed 
to be positive definite (if not, H k or G k can be slightly perturbed), subproblem (31) 
is a strongly convex quadratic. BB gradient method with step length (49) or (50) will 
converges at R— linear rate. 

For simplicity, we name the IQM with inexact Algorithm 4.1 practically iterative 
quadratic method (PIQM). Still denote { X k } the approximate matrix sequence gener¬ 
ated by PIQM. BB inner loop makes the objective function value of subproblem (31) 
decline, that is Q(X k+ 1 ) < Q(X k ). Then { J(X k )} is always decreasing which is suf¬ 
ficient and necessary for { X k } uniformly converging to the stationary point of problem 
(16). The following conclusion can be easily derived. 

Theorem 4.1. Denotes X* the output point generated by PIQM, then X* is an ap¬ 
proximate stationary point of J (X). Especially for q, p G [1,2], X* is an approximate 
global minimizer of optimization problem (16). When p is fractional, X* is one ofKKT 
points. 
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An practical version of iteratively quadratic method for joint classification in face 
recognition can be concluded as follows. 

Algorithm 4.2. (PIQMfor JRC) 

1. Start: loading A , Y and setting A > 0, q € [1,2],pg (0, 2] and precision levels 

Si > 0 , £2 > 0 . 

2. Employing PIQM to solve (16), output an approximate coding matrix X* := 

Xk+i- 

3. Classifying Y by X*. 

5 Experimental Results 

In this section, the joint representation based classification (JRC) with PIQM will be 
applied to face recognition. Three public data sets are used. Brief description is given 
as follows. 

AT&T database is formerly known “the ORL database of faces”. It consists of 400 
frontal images for 40 individuals. For each suject, 10 pictures were taken at dif¬ 
ferent times, with varying lighting conditions, multiple facial expression, adorn¬ 
ments and rotations up to 20 degree. All the images are aligned with dimension 
112 x 92. The database can be retrieved from http : // www.cl.cam.ac.uk/ 
Research/DTG/attar chive : pub/ data/att faces, tar. Z as a 4.5Mbyte com¬ 
pressed tar file. Typical pictures can be seen in Figure 1. 



Figure 1: Typical images of AT & T database 


Georgia-Tech database contains 15 images each of 50 subjects. The images are taken 
in two or three sessions at different times with different facial expressions, scale 
and background. The average size of the faces in these images is 150 x 150 
pixels. Georgia Tech face database and the annotation can be found in 
http : / /www.anefian.com/research/face r eco.htm. Typical pictures of 
four persons are shown in Figure 2. 



Figure 2: Typical images of Georgia-Tech database 
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Extended Yale B database consists of 2414 frontal-face images of 38 subjects. Each 
subject has around 64 images. The images are cropped and normalized to 192 x 
168 under various laboratory-controlled lighting conditions [27, 28], Figure 3 
displays typical pictures of 4 subjects. 


1111911111 


Figure 3: Typical images of Extended Yale B database 


Extensive experiments are conducted for different image sizes and different param¬ 
eters. Four comparable schemes are implemented, JRC, SRC, CRC-RLS and tradi¬ 
tional SVM classifier. IRC is practically carried out via PIQM while SRC is solved 
by l\ — l s solver [9] and CRC-RLS employs the code from [2], We realize SVM 
by the software LIBSVM [30] with linear kernel, the pseudo code can be found in 
http : // wwwxsie.ntu.edu.tw / cjlin / libsvm/faq.htmlftf 203. All the schemes 
are implemented by Matlab R2014a(win32) on a typical 4GiB memory and 2.40GHz 
PC. 

Considering that JRC is a joint framework including SRC and CRC-RLS, we select 
six pairs of q, p in [1, 2] and (0, 2] respectively: 

q = p = 2 (corresponding to CRC-RLS), 

<7 = 2, p = 1 (corresponding to SRC), 

and other four generalized cases 

q = 1.5Szp=l, g=1.5&p = 0.5, 
q = llkp=l, g = l&p = 0.5. 

The parameter A in (16) is varied from 0.01 to 10 each 10 times, and the best result is 
picked out. All the stopping precisions are set 1 0” 3 . 

All the images are re-sized like that of [1, 2]. For AT&T database, the pictures are 
down sampled to 11 x 10. The downsampling ratios of Georgia-Tech database and 
Extended Yale B database are 1/8 and 1/16. For each subject, around 80% pictures 
are randomly selected for training and the left for testing. For example, 8 pictures of 
each individual in AT&T database are randomly picked out for training while the left 2 
are for testing. All the classification schemes are directly applied to the images without 
any pre-processing. The recognition accuracy and running time are reported in Table 
1-3. 

Based on the experimental results on three databases, we draw the following con¬ 
clusions: 

• Jointly representing all the testing images simultaneously does accelerate face recog¬ 
nition. On all the databases, JRC (q=p=2) is the fastest one. The CPU time is thou¬ 
sand times less than that of SRC. For example, JRC (q=p=2) classifies 484 images in 
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Methods 

The recognition accuracy 

CPU time 

SRC 


98.75 


67.2658 

JRC(q=2,p=l) 

97.5 


0.1612 

CRC-RLS 


95 


0.0872 

JRC(q=p=2) 


97.5 


0.0073 

SVM 


95 


0.0667 

JRC(q=1.5,p= 

= D 

97.5 


0.3867 

JRC(q=1.5,p= 

=0.5) 

95 


1.8756 

JRC(q=p=l) 


97.5 


0.1994 

JRC(q=l,p=0.5) 

97.5 


0.1640 

Table 1: The recognition accuracy (%) and running time (second) 


for AT&T database 




Downsampling ratio 1/8 

Downsampling ratio 1/16 

Methods 

Accuracy 

Time 

Accuracy Time 

SRC 

99.33 

2843 

97.33 

3197 

JRC(q=2,p=l) 

99.33 

2.41 

97.33 

1.07 

CRC-RLS 

98 

1.95 

96.67 

0.66 

JRC(q=p=2) 

99.33 

0.97 

98.67 

0.17 

SVM 

96.67 

5.09 

96.67 

1.46 

JRC(q=1.5,p=l) 

99.33 

4.89 

98.67 

3.86 

JRC(q=1.5,p=0.5) 

99.33 

4.89 

98.67 

3.89 

JRC(q=p=l) 

99.33 

5.54 

99.33 

1.11 

JRC(q=l,p=0.5) 

99.33 

4.79 

99.33 

1.09 


Table 2: The recognition accuracy (%) and CPU time (second) 
for Georgia-Tech database 


Down sampling ratio 1/8 Down sampling ratio 1/16 


Methods 

Accuracy 

Time 

Accuracy 

Time 

SRC 

96.76 

4828 

96.36 

668.53 

JRC(q=2,p=l) 

96.96 

22.67 

76.11 

164.71 

CRC-RLS 

96.76 

2.02 

95.55 

1.9 

JRC(q=p=2) 

96.96 

0.75 

91.29 

0.34 

SVM 

95.55 

6.12 

94.33 

2.61 

JRC(q=1.5,p=l) 

96.96 

22.04 

87.05 

22.03 

JRC(q=1.5,p=0.5) 

96.96 

54.21 

65.59 

101.59 

JRC(q=p=l) 

96.96 

27.08 

90.49 

20.51 

JRC(q=l,p=0.5) 

96.96 

26.87 

91.29 

25.23 


Table 3: The recognition accuracy (%) and CPU time (second) 
for Extended Yale B database 
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0.17 second on Georgia-Tech database with downsampling ratio 1/16. And the ac¬ 
curacy rate is 98.67%, outperforming SRC (97.33%), CRC-RLS (96.67%) and SVM 
(96.67%). More details can be found in Table 1-3. 

• JRC exhibits competitive performance in recognition accuracy. On AT & T database, 
the recognition rate of JRC is 97.5%, compared to 98.75% for SRC, 95% for CRC-RLS 
and SVM. On Georgia-Tech database, JRC achieves the best recognition rate (99.33%), 
consistently exceeds other classification schemes. On Yale B database with downsam¬ 
pling ratio 1/8, JRC also outperforms other methods in accuracy. Unfortunately, JRC 
does not keep the best achievement on downsampling ratio 1/16. The possible reason 
is that some pictures with strong contrast of lighting (see Figure 3) aggravates the noise 
for other images in joint coding. 

• Different q £ [1,2] and p £ (0, 2] for JRC indicate different feature pattern behind 
in the image set. Taking JRC (q = 2,p = 1) for example, the joint model combines 
sparsity of representation and correlation of multiple images. The representation co¬ 
efficients reveal the joint effect on JRC (q = 2, p = 1), Figure 4 gives an example 
from Yale B database. Compared to SRC, JRC (q = 2,p = 1) concentrates a group 
sparsity but not a single one. Acutally, the other testing samples (12 pictures) of the 
same subject also have the similar group representation pattern. 



Figure 4: The recovered coefficients by JRC (q=2,p=l) and SRC 

• The convergence behavior of PIQM for JRC is displayed in Figures 5. The x axis 
is the iterations and y-axis stands for the logarithm of pk- PIQM converges within 40 
steps on three databases for all jointly sparse models (five pairs q and p). JRC (q=p=2) 
always converges in three iterations hence its plot is omitted here. Anyway, PIQM 
provides a uniform algorithm for varied JRC with respect to q £ [1,2] and p £ (0, 2]. 



(a) (b) (c) 


Figure 5: (a) PIQM on AT & T(b) PIQM on Georgia-Tech (c) PIQM on Yale B 
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• From Table 1-3, it is observed that CRC-RLS has a fairly good performance in recog¬ 
nition accuracy and CPU time. But CRC-RLS is heavily sensitive to the regularization 
parameter A (see Table 4) because it has a smooth regularizer. By comparison, JRC 
(q=p=2) is more stable for its joint technique. Multiple images has complementary 
effect for recognition especially when the model is ill-posed. 


A = 

0.01 

0.1 

1 

10 

100 

CRC-RLS 

28.34 

66.82 

95 

96.76 

96.76 

IRC(q=p=2) 

96.96 

96.96 

96.96 

96.96 

96.96 


Table 4: The recognition accuracy (%) for different A on Extended Yale B database 
with downsampling ratio 1/8 


6 Conclusions 

In this paper, a joint representation classification for collective face recognition is pro¬ 
posed. By aligning all the testing images into a matrix, joint representation coding is 
reduced to a kind of generalized matrix pseudo norm based optimization problems. A 
unified algorithm is developed to solve the mixed h,q~ (2,p-mmnmzations f or gg [1,2] 
and p £ (0, 2]. The convergence is also uniformly demonstrated. To adapt the algo¬ 
rithm to the large scale case, a practical iterative quadratic method is considered to 
inexactly solve the subproblems. Experiment results on three data-sets validate the 
collective performance of the proposed scheme. The joint representation based classi¬ 
fication is confirmed to improve the performance in recognition rate and running time 
than the state-of-the-arts. 
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code support. 
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